GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 10 - Model Selection And Boosting/Grid Search/[R] Grid Search.ipynb
¹³⁴¹ views

Kernel: R

Grid Search

Data preprocessing

In [1]:

# Importing the dataset
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

In [2]:

# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

In [3]:

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(1234)
split = sample.split(dataset$Purchased, SplitRatio = 0.80)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

In [4]:

# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Applying Grid Search to find the best model and the best parameters

In [5]:

library(caret)
classifier = train(form = Purchased ~ ., 
                   data = training_set, 
                   method = 'svmRadial')

Out[5]:

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

In [6]:

classifier

Out[6]:

Support Vector Machines with Radial Basis Function Kernel 

320 samples
  2 predictor
  2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 320, 320, 320, 320, 320, 320, ... 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  0.25  0.9058640  0.7969638
  0.50  0.9065979  0.7987936
  1.00  0.9054507  0.7962180

Tuning parameter 'sigma' was held constant at a value of 1.599667
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 1.599667 and C = 0.5.

In [7]:

classifier$bestTune

Out[7]:

Fitting classifier to the Training set

In [8]:

#  install.package('e1071')
library(e1071)
classifier = svm(formula = Purchased ~ ., 
                 data = training_set, 
                 type = 'C-classification', 
                 kernel = 'radial', 
                 C = 0.5,
                 sigma = 1.599667)

Predicting the Test set results

In [9]:

y_pred = predict(classifier, newdata = test_set[-3])

In [10]:

head(y_pred, 10)

Out[10]:

In [11]:

head(test_set[3], 10)

Out[11]:

Applying k-Fold Cross Validation

In [12]:

library(caret)

In [13]:

folds = createFolds(training_set$Purchased, k = 10)
cv = lapply(folds, function(x){
    training_fold = training_set[-x, ]
    test_fold = training_set[x, ]
    classifier = svm(formula = Purchased ~ ., 
                 data = training_fold, 
                 type = 'C-classification', 
                 kernel = 'radial')
    y_fold_pred = predict(classifier, newdata = test_fold[-3])
    cm = table(test_fold[, 3], y_fold_pred)
    accuracy = (cm[1, 1] + cm[2, 2])/(sum(cm))
    return(accuracy)
})

In [14]:

cv

Out[14]:

In [15]:

mean(as.numeric(cv))

Out[15]:

In [16]:

sd(as.numeric(cv))

Out[16]:

This signifies that we are in Low Bias Low Variance category in Bias-Variance TradeOff.

Making the Confusion Matrix

In [17]:

cm = table(test_set[, 3], y_pred)
cm

Out[17]:

   y_pred
1
45  6
4 25

classifier made 45 + 25 = 75 correct prediction and 6 + 4 = 10 incorect predictions.

Visualising the Training set results

In [18]:

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
     main = 'Kernel SVM (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'), col='white')
legend("topright", legend = c("0", "1"), pch = 16, col = c('red3', 'green4'))

Out[18]:

Visualising the Test set results

In [19]:

set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'Kernel SVM (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'), col='white')
legend("topright", legend = c("0", "1"), pch = 16, col = c('red3', 'green4'))

Out[19]:

Looks like it is much better the Linear kernel.